24 research outputs found

    Domain-specific language models and lexicons for tagging

    Get PDF
    AbstractAccurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost

    Overview of BioCreative II gene mention recognition.

    Get PDF
    Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions

    Multi-document Summarization by Visualizing Topical Content

    No full text
    This paper describes a framework for multidocument summarization which combines three premises: coherent themes can be identified reliably; highly representative themes, running across subsets of the document collection, can function as multi-document summary surrogates; and effective end-use of such themes should be facilitated by a visualization environment which clarifies the relationship between themes and documents. We present algorithms that formalize our framework, describe an implementation, and demonstrate a prototype system and interface. 1 Introduction: multi-document summarization as an enabling technology for IR The rapid growth of electronic documents has created a great demand for a navigation tool to traverse a large corpus. Information retrieval (IR) technologies allow us to access the documents presumably matching our interests. However, a traditional hit list-based architecture, which returns linearly organized single document summaries, no longer suffices, given..

    Domain-specific language models and lexicons for tagging

    No full text
    been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center, P

    Cadmium toxicokinetics and bioaccumulation in turtles: trophic exposure of Trachemys scripta elegans

    Full text link
    Ecotoxicological data in reptiles are mainly represented by field studies reporting tissues burden of wild-captured individuals but much less is known on processes of uptake, depuration, accumulation and effects of inorganic contaminants in these species. In this study, females’ Trachemys scripta elegans were exposed to cadmium (Cd) through a CdCl2 supplemented-diet with increased environmental relevant concentrations during 13 weeks and then went through a decontamination phase during 3 weeks being fed uncontaminated food. Blood and feces were collected during the three phases of the experiment and the turtles were sacrificed at the end of the experiment and organs samples collected. The Cd concentrations in blood remained stable over the course of the experiment while Cd concentrations in feces increased with time and with amount of Cd ingested. Assimilation efficiency in liver and kidney together was low (0.7 – 6.1 %) but did occur and Cd accumulated in a dose-dependent manner in organs in the following order of concentrations: kidney>liver>pancreas>muscle. In terms of organs burden, Cd-burden was the highest in liver followed by kidney and pancreas. The assimilation efficiency decreased as Cd ingested increased suggesting that at higher dose of Cd absorption decreased and/or depuration increased. Mineral content of the liver was modified according to Cd level with increased concentrations of zinc and iron with increasing Cd levels. Accumulation of Cd had no effects on survival, food consumption, growth or weight and length suggesting no effect of treatment on females’ body conditions
    corecore